The following is a demonstration of how to load COVID-19 cases from some different sources:

library(tidyverse)
library(mapview)
library(sf)
library(tigris)

options(
  tigris_class = "sf"
)

We’ll start with the Nature dataset. You will need to clone the nCoV2019 repo to your own machine.

opts_knit$set(root.dir = "~/GitHub/nCoV2019") # change this path if necessary.

The repo conveniently puts the latest data in a folder called “latest_data” in a file called “latestdata.csv”. But the script below demonstrates how you can extract a filename you don’t necessarily know beforehand.

filename <- 
  list.files(path = "latest_data", full.names=T)

covid_full <- 
  read_csv(filename) # this loads the data. you should see a big dataframe in your environment.

It’s as simple as that to get the data into RStudio to work with. You’re welcome to explore your own analyses, but for this demonstration, we’ll filter down to total case maps by state and county and do some basic outputs.

Note that the dataset includes lat/long fields (use colnames(covid_full) to easily see field names in your console), as well as other geographic fields that could potentially be used to associate rows with specific geographies.

My preference is to use their lat/long fields to associate every row with a geolocated coordinate, and then determine membership in a geographic boundary using spatial joins. I consider this a more universal approach to deal with hierarchies of geography directly (in county vs. in state), and it takes advantage of the most granular possible information that might be logged. But we’ll need to check the cleanliness of this field.

covid_missing_coords <-
  covid_full %>% 
  filter(is.na(latitude) | is.na(longitude))

covid_missing_coords %>% nrow()

covid_missing_coords %>% filter(country == "United States") %>% nrow()

covid_missing_coords %>% filter(province == "California") %>% nrow()
## [1] 1133
## [1] 51
## [1] 0

The chunk above creates a subset of rows in which lat or long are missing, reports the count, and then reports the count specifically in the U.S. and in California. The missing data seems to be negligible for our purposes, but next we’ll make a version of the data that filters out these rows with missing coordinates and then convert the remaining dataframe into an sf object, where the points have been geolocated.

covid_full_w_coords <-
  covid_full %>% 
  filter(!is.na(latitude) & !is.na(longitude)) %>% 
  st_as_sf(coords = c("longitude","latitude"), crs = st_crs(4326)) # Google EPSG 4326. I guessed that the data was provided in this standard coordinated system, and using st_crs(4326) helps the function "read" the lat long information in the correct system.

# mapview(covid_full_w_coords$geometry)
# You can map at this point, but it's a lot of points and may take some time, so you can skip this. Notice I've used $ to grab just the geometry field, as mapview would take even longer to render all the other fields of information.

Next, we’ll map by state in the U.S. Note that the dataframe likely has good documentation of state under the fields “province” or “admin1”, but the script below uses st_join based on the point coordinates and Census state polygons.

states <- 
  states(cb = F, progress_bar = FALSE) %>% # This comes from tigris package. Google tigris r to see full functionality. Also note that in the first chunk, we set a tigris option to automatically load polygons as sf type.
  st_transform(4326) # tigris by default is in a different coordinate system, so we have to transform to maintain consistency with our points.

covid_by_state <-
  covid_full_w_coords %>% # Recall this already is an sf object with points
  st_join(states) %>% # This joins to state polygons and will attach information from the states object to the covid object, row by row. But using st_join() in this direction means the data is still "points"
  st_set_geometry(NULL) %>% # We won't need the point geometries anymore. Removing them here will speed up the next two steps.
  group_by(NAME) %>% # The NAME field came from the states object. Remember this is maybe redundant with state names that were already in the covid dataset, but we're just being systematic here and not assuming the data is clean.
  summarize(
    Count = n() # This will collapse the data to just 50 rows for 50 states (6 other territories too), and preserve only state names and a count of rows (cases) of covid data for each state. If you want to retain other useful data as sums, averages, etc., you'd do it here.
  ) %>% 
  right_join(
    states, # This now attaches our case counts to the states file which has the state polygons, so we can represent the counts on a country map. We're right joining just in case there's a state that doesn't have cases, but we still want to map that state as NA. right_join() preserves the size of the right side.
    by = "NAME"
  ) %>% 
  st_as_sf() # Since the object in the pipeline is not a sf object (we got rid of geometry), but now we have a geometry field that came from the right_join(), this just gets the dataset to "recognize" a geometry field in itself.

mapview(covid_by_state, zcol = "Count") # This is the easiest way to make interactive maps for a webpage, though it's not necessarily the best for customizing look. 

It’s probably immediately clear that NY has way more cases than any other state, and screws up our color range.

nature_state_count <-
  covid_by_state %>% 
  st_set_geometry(NULL) %>% 
  dplyr::select(NAME, Count) %>% 
  arrange(desc(Count))

kable(
  nature_state_count,
  caption = "Counts by state from Nature database"
)
Counts by state from Nature database
NAME Count
New York 53516
New Jersey 7618
California 6956
Florida 5363
Illinois 4914
Washington 4213
Michigan 3099
Georgia 2876
Louisiana 2792
Massachusetts 2745
Colorado 2524
Connecticut 2152
Pennsylvania 1872
Texas 1540
Indiana 1057
Tennessee 1055
Arizona 994
Ohio 986
Alabama 907
Wisconsin 819
North Carolina 773
Maryland 729
Nevada 630
Mississippi 565
Missouri 549
Virginia 525
South Carolina 488
District of Columbia 472
Utah 464
Minnesota 461
Arkansas 438
Oregon 403
Idaho 349
Iowa 342
Oklahoma 297
Kentucky 288
Maine 211
Delaware 206
New Hampshire 202
Kansas 201
Rhode Island 197
New Mexico 169
Vermont 167
Alaska 114
Montana 104
Nebraska 98
Hawaii 96
West Virginia 78
North Dakota 78
Wyoming 71
Puerto Rico 64
South Dakota 58
United States Virgin Islands NA
Commonwealth of the Northern Mariana Islands NA
Guam NA
American Samoa NA

Next we’ll filter to just a California dataset.

covid_ca <-
  covid_full_w_coords %>% 
  st_join(states) %>% 
  filter(NAME == "California")

# As a demonstration, below is an alternate (and quicker) method. You can compare the two datasets.

# covid_ca_alternate <-
#   covid_full_w_coords %>%
#   filter(admin1 == "California" | province == "California")
# 
# covid_ca %>% filter(!ID %in% covid_ca_alternate$ID) %>% nrow()
# covid_ca_alternate %>% filter(!ID %in% covid_ca$ID) %>% nrow()

# Looks like it's a perfect match.

Next we’ll map by counties, similar method as mapping by state.

ca_counties <- 
  counties("CA", cb = F, progress_bar = FALSE) %>% 
  st_transform(4326)

covid_ca_by_county <-
  covid_ca %>% 
  dplyr::select(-NAME) %>% 
  st_join(ca_counties) %>% 
  st_set_geometry(NULL) %>% 
  group_by(NAME) %>% 
  summarize(Count = n()) %>% 
  right_join(
    ca_counties, 
    by = "NAME"
  ) %>% 
  st_as_sf()
mapview(covid_ca_by_county, zcol = "Count")
nature_ca_county_count <-
  covid_ca_by_county %>% 
  st_set_geometry(NULL) %>% 
  dplyr::select(NAME, Count) %>% 
  arrange(desc(Count))

kable(
  nature_ca_county_count,
  caption = "COVID-19 cases in CA by county from Nature database"
)
COVID-19 cases in CA by county from Nature database
NAME Count
Madera 5078
Santa Clara 377
Los Angeles 361
San Diego 186
San Mateo 145
San Francisco 112
Orange 91
Alameda 90
Contra Costa 85
Sacramento 81
Marin 47
Riverside 36
San Joaquin 30
Ventura 24
Sonoma 24
Santa Cruz 24
San Luis Obispo 24
Solano 20
Placer 19
Tulare 14
San Bernardino 11
Santa Barbara 10
San Benito 9
Fresno 8
Stanislaus 8
Monterey 7
Yolo 7
Imperial 6
Calaveras 4
Kern 4
Humboldt 3
Shasta 2
El Dorado 2
Nevada 2
Amador 2
Mendocino 1
Mono 1
Yuba 1
Sierra NA
Kings NA
Mariposa NA
Lassen NA
Napa NA
Trinity NA
Inyo NA
Tuolumne NA
Alpine NA
Colusa NA
Del Norte NA
Modoc NA
Tehama NA
Butte NA
Merced NA
Sutter NA
Lake NA
Plumas NA
Siskiyou NA
Glenn NA

See something strange?

You should notice a specific county with abnormally high numbers. If you dig into the data, you’ll notice that thousands of cases in Madera County are at the “geo_resolution” level of “admin1”, or state, which would imply that these are CA cases not documented as attached to a specific county, and therefore just given the lat/long of CA’s centroid, which happens to be in Madera County. It’s a separate issue to resolve how to map these cases, but for the meantime, the previous map and table are corrected below.

covid_ca_by_county_corrected <-
  covid_ca %>% 
  filter(geo_resolution != "admin1") %>% # Remove those CA-general cases
  dplyr::select(-NAME) %>% 
  st_join(ca_counties) %>% 
  st_set_geometry(NULL) %>% 
  group_by(NAME) %>% 
  summarize(Count = n()) %>% 
  right_join(
    ca_counties, 
    by = "NAME"
  ) %>% 
  st_as_sf()

mapview(covid_ca_by_county_corrected, zcol = "Count")
nature_ca_county_count_corrected <-
  covid_ca_by_county_corrected %>% 
  st_set_geometry(NULL) %>% 
  dplyr::select(NAME, Count) %>% 
  arrange(desc(Count))

kable(
  nature_ca_county_count_corrected,
  caption = "COVID-19 cases in CA by county from Nature database (corrected)"
)
COVID-19 cases in CA by county from Nature database (corrected)
NAME Count
Santa Clara 377
Los Angeles 361
San Diego 186
San Mateo 145
San Francisco 112
Orange 91
Alameda 90
Contra Costa 85
Sacramento 81
Marin 47
Riverside 36
San Joaquin 30
Ventura 24
Sonoma 24
Santa Cruz 24
San Luis Obispo 24
Solano 20
Placer 19
Tulare 14
San Bernardino 11
Santa Barbara 10
San Benito 9
Fresno 8
Stanislaus 8
Monterey 7
Yolo 7
Imperial 6
Calaveras 4
Kern 4
Madera 4
Humboldt 3
Shasta 2
El Dorado 2
Nevada 2
Amador 2
Mendocino 1
Mono 1
Yuba 1
Sierra NA
Kings NA
Mariposa NA
Lassen NA
Napa NA
Trinity NA
Inyo NA
Tuolumne NA
Alpine NA
Colusa NA
Del Norte NA
Modoc NA
Tehama NA
Butte NA
Merced NA
Sutter NA
Lake NA
Plumas NA
Siskiyou NA
Glenn NA